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Abstract 

A common  complaint  when  dealing  with  the 
performance  of  computationally  intensive  scientific 
applications  on  parallel  computers  is  that  programs  exist 
to  predict  the  performance  of  radar  systems,  missiles  and 
artillery  shells,  drugs,  etc.,  but  no  one  knows  how  to 
predict  the  performance  of  these  applications  on  a 
parallel  computer.  Actually,  that  is  not  quite  true.  A 
more  accurate  statement  is  that  no  one  knows  how  to 
predict  the  performance  of  these  applications  on  a 
parallel  computer  in  a reasonable  amount  of  time. 
PENVELOPE  is  an  attempt  to  remedy  this  situation.  It  is 
an  extension  to  Amdahl’s  Law/Gustafson’s  work  on 
scaled  speedup  that  takes  into  account  the  cost  of 
interprocessor  communication  and  operating  system 
overhead,  yet  is  simple  enough  that  it  was  implemented  as 
an  Excel  spreadsheet. 

1.  Introduction 

A common  complaint  when  dealing  with  the 
performance  of  computationally  intensive  scientific 
applications  on  parallel  computers  is  that  programs  exist 
to  predict  the  performance  of  radar  systems,  missiles  and 
artillery  shells,  drugs,  etc.,  but  no  one  knows  how  to 
predict  the  performance  of  these  applications  on  a parallel 
computer.  Actually,  that  is  not  quite  true.  A more 
accurate  statement  is  that  no  one  knows  how  to  predict 
the  performance  of  these  applications  on  a parallel 
computer  in  a reasonable  amount  of  time.  We  are 
developing  a fast  model  of  the  performance  of  these 
applications  on  modem  parallel  computer  architectures. 
As  the  first  step  in  the  process,  we  have  chosen  to  model 
pure  MPI  applications.  In  general  our  approach  works 
best  for  programs  using  a single  communicator. 


However,  it  should  be  possible  to  use  PENVELOPE  to 
model  more  complicated  applications. 

The  model  uses  “Back-of-the-Envelope”  methods 
and  relies  on  either  measured  or  predicted  (e.g.,  using  the 
ENVELOPE  model  developed  jointly  at  the  US  Army 
Research  Laboratory  and  the  University  of  Tennessee, 
Knoxville1'1)  run  times  for  single  processor  runs.  It  also 
relies  on  measured  numbers  of  calls  and  the  amount  of 
data  transferred  for  each  of  the  commonly  used  MPI-1 
calls  (e.g.,  send,  receive,  and  the  more  commonly  used 
collective  operations).  This  of  course  means  taking  the 
parallel  measurements  for  each  number  of  processors  one 
is  interested  in.  However,  the  model  assumes  that  these 
numbers  will  be  system  independent,  so  the 
measurements  only  need  to  be  made  once  per  application. 
In  contrast,  the  serial  run  times  will  be  needed  for  each  of 
the  systems  to  be  modeled. 

In  addition  to  the  application  specific  data,  one  also 
needs  system  specific  data  concerning  peak  intemode 
bandwidth,  peak  interprocessor/intemode  bandwidth  with 
only  a single  sender/receiver  pair,  and  the  minimum 
latency  for  passing  a one  byte  message.  All  of  these 
numbers  of  course  assume  one  is  using  MPI.  While  this 
is  a lot  of  information  to  collect,  fortunately  it  only  needs 
to  be  collected  once  per  system.  Many  vendors  and  some 
supercomputer  sites  (e.g.,  Oak  Ridge  National 
Laboratory)  routinely  publish  this  information.  Once 
collected,  it  can  be  applied  to  the  simulation  of  the 
performance  of  any  number  of  applications. 

The  application  specific  and  system  specific 
information  is  then  entered  into  a series  of  Microsoft 
Excel  spreadsheets  that  our  team  has  put  together  (one 
spreadsheet  per  application-number  of  processors 
pairing).  While  it  can  take  some  time  to  initially  collect 
this  data,  the  time  required  for  the  spreadsheet  to  perform 
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its  calculations  is  negligible  on  today’s  PCs.  This  paper 
will  discuss  our  experimental  results  based  on  the  NAS 
benchmarks121. 

This  project  was  supported  by  a grant  of  computer 
time  from  the  DoD  High  Performance  Computing 
Modernization  Program. 

2.  The  Problem 

How  does  one  predict  the  performance  of  a 
parallelized  scientific  application  on  a computer?  In 
general,  one  is  not  interested  in  just  one  computer. 
Rather,  there  are  likely  to  be  multiple  competing  systems 
and  the  user  is  trying  to  decide  which  system(s)  to  request 
access  to.  Frequently,  the  system  staff  is  attempting  to 
evaluate  competing  bids  from  multiple  vendors,  possibly 
for  systems  that  will  not  be  generally  available  for  several 
more  months.  Complicating  matters  further,  one  may 
have  multiple  applications,  multiple  data  sets  per 
application,  and  almost  certainly  will  want  the 
applications  run  for  a varying  numbers  of  processors. 

3.  The  Traditional  Solutions 

The  most  common  answer  to  this  problem  is  to 
measure  the  performance.  This  of  course  takes  time  and 
in  many  cases  requires  considerable  resources. 
Additionally,  when  talking  about  one  of  a kind  systems 
and/or  systems  that  are  still  being  developed,  this  is  not 
even  a possibility.  The  most  common  solutions  to  this 
problem  have  been: 

• Wait  until  the  systems  are  available  and  then  run 
your  benchmarks. 

• Rely  on  industry  standard  benchmarks  such  as 
Unpack'31,  STREAMS'4',  or  the  NPB  benchmark 
suite  out  of  NASA  Ames  Research  Laboratory. 

• Extrapolate  from  runs  made  on  the  previous 
generation  of  hardware  from  the  same  vendor 
and  hope  for  the  best. 

As  was  demonstrated  in  Reference  5,  there  can  be  a 
considerable  degree  of  variability  in  the  delivered  levels 
of  performance  for  each  of  these  benchmarks.  So  which 
if  any  of  them  should  one  use?  Obviously  what  is  needed 
is  an  entirely  different  approach. 

4.  Back-of-the-Envelope  Calculations 

A promising  concept  for  an  alternative  approach  is  to 
use  Back-of-the-Envelope  calculations.  Engineers  and 
scientists  have  been  using  this  approach  for  decades  if  not 
centuries  to  predict  the  outcome  of  their  work  prior  to 


committing  themselves  to  a particular  design  or  costly 
experiment.  The  key  concept  is  to  remove  enough  of  the 
fluff  that  the  equations  can  be  readily  solved  while  not 
eliminating  any  of  the  details  that  I really  matter. 
Amdahl’s  law  and  the  work  of  Gustafson  on  scaled 
speedup  are  prime  examples  of  this  approach  at  work. 
While  in  many  cases  these  approaches  are  good  enough, 
they  will  fail  in  two  key  areas: 

• Neither  will  tell  you  if  the  program  has  been 

poorly  parallelized  (e.g.,  the  granularity  is  too 
fine).  j 

• Both  lack  any  insights  as  to  the  importance  of 
the  system  interconnect.  For  all  they  care,  a 1 
Byte/second  interconnect  is  justl  as  good  as  a 1 
GB/second  interconnect.  1 Similarly,  an 
interconnect  with  a 1 minute  message  passing 
latency  is  of  equal  value  to  jone  with  a 1 
microsecond  message  passing  latency. 

What  is  needed  is  an  extension  that  incorporates  both 
Amdahl’s  law  and  the  work  of  Gustafson^  while  making  a 
limited  effort  to  take  into  account  the|  design  of  the 
program  and  the  systems  it  will  run  on.  ; 

In  the  general  case,  this  problem  may  be  too 
complicated  to  solve  using  this  approach.  However,  it 
was  felt  that  if  one  limited  the  problem  jto  a commonly 
occurring  case  or  set  of  cases,  then  it  might  be  possible  to 
achieve  usable  results  with  the  desired  degree  of  effort. 
The  obvious  choice  was  programs  parallelized  using  the 
more  commonly  used  MPI-1  features  j with  a single 
communicator.  With  some  effort  on  the  part  of  the  user, 
it  should  be  possible  to  extend  the  model  to  cover 
programs  with  multiple  communicators.  'At  this  point  in 
the  development  of  PENVELOPE,  no  attempt  has  been 
made  to  investigate  this  possibility. 

I 

5.  General  Approach  1 


Based  on  data  collected  using  TAU167'8',  or  other 
similar  utilities,  the  cost  for  the  data  communications  is 
estimated.  This  means  that  on  at  least  j one  system  an 
instrumented  run  must  be  made  for  each  number  of 
processors  being  used.  As  previously  mentioned,  one 
will  also  need  information  concerning  the|  bandwidth  and 
latency  for  the  system  interconnects,  provisions  have 
been  made  for  specifying  the  cost  of  the  I/O,  operating 
system  overhead,  and  whether  or  not  communications  are 
overlapped  with  computations.  The  model  assumes  that 
collective  communications  are  never  overlapped  with 
computations.  Sends  and  Receives  may  j or  may  not  be 
overlapped.  Currently  this  is  a simple  yesi  or  no  question, 
with  no  provisions  for  partial  overlapping  of 
communications  with  computations.  The  cost  of  the 
computations  is  estimated  using  Amdahl’s  law  for  fixed 


problem  sizes,  or  Gustafson’s  work  for  problems 
involving  scaled  speedup. 

6.  Limitations 

There  are  four  main  limitations  with  the  model  at  the 
present  time.  The  first  one  is  difficult  to  know  how  to 
handle.  When  discussing  an  application’s  performance, 
one  would  like  to  think  in  terms  of  there  being  a single 
value  for  the  run  time.  In  reality,  there  will  always  be  a 
range  of  values.  If  this  range  is  small  enough  (e.g.,  1- 
10%),  then  it  probably  does  not  matter.  However,  in  a 
recent  effort  to  benchmark  some  of  the  systems  at  the 
ARL-MSRC,  the  following  degrees  of  variability  were 
observed. 


512  PE  SGI  03K  400  MHz  (800  MFLOPS)  64  bit 
compilers  16  KB  page  size  Summary  of  the 
Frequency  Counts  for  (Worst  Time/Best  Time)  for  All 
Classes 
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Figure  3.  The  variability  in  run  times  on  the  SGI  Origin 
3000  at  the  ARL-MSRC  (0.1  increment  binning) 
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Intel  Pentium  4 3.06  GHz  (6.12  GFLOPS)  Cluster  with 
Myrinet  2000  Switch  (2  PE  per  node)  Summary  of  the 
Frequency  Counts  for  (Worst  Time/Best  Time)  for  Ail 
Classes 
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Figure  1.  The  variability  in  run  times  on  the  Intel 
Pentium  4 cluster  at  the  ARL-MSRC  (0.1  increment 
binning) 

IBM  SP  375  MHz  (1.5  GFLOPS)  Power3  (NH2)  with 
Colony  Switch  (single  rail,  16  PE/node)  Summary  of 
the  Frequency  Counts  for  (Worst  Time/Best  Time)  for 
All  Classes 
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Figure  2.  The  variability  in  run  times  on  the  IBM  SP 
Power3  (NH2)  at  the  ARL-MSRC  (0.1  increment 
binning) 


The  second  problem  is  that  many  systems  have 
insufficient  memory  bandwidth  in  a node  to  peg  the 
interface  on  all  of  the  processors  at  the  same  time. 
Therefore,  when  going  from  a partially  filled  node  to  a 
fully  utilized  node,  there  may  be  a significant  decrease  in 
the  per  processor  level  of  performance.  The  third 
problem  is  superlinear  speedup,  where  as  the  processor 
count  increases  the  amount  of  work  per  processor 
decreases  to  the  point  that  the  working  set  fits  in  cache. 
A side  effect  of  this  is  that  once  the  program  enters  the 
region  of  superlinear  speedup,  the  demands  on  the 
memory  system  can  drop  markedly.  In  many  cases,  this 
will  eliminate  the  limitations  discussed  in  the  second 
problem. 

The  fourth  problem  is  related  to  the  second  and  third 
problems.  What  value  should  one  use  for  the  serial 
runtime?  For  large  problems,  it  may  not  even  be  possible 
to  run  the  application  on  a single  processor.  Assuming 
that  it  is  possible  to  run  the  application  on  a single 
processor,  is  it  best  to  use  the  measured  serial  run  time, 
the  measured  run  time  for  a small  number  of  processors* 
that  number  of  processors  (this  might  eliminate  some  of 
the  errors  associated  with  problems  2 and  3),  or  use  a 
model  such  as  ENVELOPE  to  estimate  the  serial 
performance  based  on  more  appropriate  assumptions  for 
per  processor  memory  bandwidth  and  cache  hit  rates 
when  using  N processors.  Currently  for  4-9  processors, 
the  measured  serial  run  time  was  used  in  our  experiments 
whenever  it  was  available.  For  larger  numbers  of 
processors,  4*  the  four  processor  run  time  was  used 
consistently.  In  some  cases,  a better  choice  would  have 
been  to  use  the  minimum  of  these  two  values,  while  the 
best  solution  is  likely  to  be  modeling  the  serial 
performance. 
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7.  Results 


In  an  attempt  to  jump  start  the  process,  data  from19' 10 
and  111  were  used  to  calculate  the  communication  costs. 
Hardware  information  came  from  vendor  websites, 
vendor  presentations  at  Supercomputer  2003  and  other 
conferences,  from  numerous  publications  out  of  Oak 
Ridge  National  Laboratory  and  the  Ohio  Supercomputer 
Center/Ohio  State  University.  At  the  present  time,  we  are 
assuming  that  the  BT,  CG,  EP,  FT,  and  LU  benchmarks 
overlap  computation  with  communication,  while  the  MG 
and  SP  benchmarks  do  not  overlap  computation  with 
communication.  It  now  appears  as  though  on  some 
systems  the  MG  benchmark  may  be  able  to  overlap 
computation  with  communication  (probably  due  to  the 
buffering  of  messages).  Benchmark  data  for  the  class  W, 
A,  and  B benchmarks  were  collected  for  four  systems  at 
the  US  Army  Research  Laboratory-Major  Shared 
Resource  Center  (ARL  MSRC).  This  data  was 
supplemented  with  results  published  by  the  NAS  group  at 
NASA  Ames  Research  Laboratory  and  roughly  40  other 
sites  on  the  web,  along  with  correspondence  with  some  of 
the  vendors  and  two  other  supercomputing  sites. 

Based  on  this  data,  the  run  time  was  estimated  for  a 
large  number  of  combinations  of  system,  number  of 
processors  (in  some  cases  out  to  64  processors),  and 
benchmarks  for  the  class  A data  sets.  A smaller  number 
of  combinations  were  estimated  for  the  class  B and  W 
data  sets  due  to  the  more  limited  amount  of  information 
available  for  modeling  these  data  sets.  When  sufficient 
information  existed,  the  predicted  run  times  were 
compared  to  the  measured  runtimes  (using  the  best  times 
when  more  than  one  measurement  was  made).  Similarly, 
the  predicted  run  times  using  Amdahl’s  law  was 
compared  to  the  measured  run  times.  Figures  4 and  5 
show  these  results.  In  many  cases  there  was  little 
difference  between  the  two  sets  of  predictions  due  to  the 
overlapping  of  communication  with  computation.  Where 
there  was  a strong  degree  of  superlinear  speedup,  neither 
model  worked  well,  but  PENVELOPE  appears  to  be 
worse.  In  cases  where  the  communication  costs  were 
large,  PENVELOPE  is  significantly  more  accurate  than 
using  Amdahl’s  law  by  itself.  In  a number  of  cases  where 
neither  cache  effects  nor  communications  costs  were 
large,  Amdahl’s  law  tended  to  under  estimate  the  run  time 
by  10-30  % while  PENVELOPE  would  over  estimate  the 
run  time  by  a similar  amount. 


i 

I 


Frequency  Count 


Measured/Predicted  i 


Figure  4.  The  accuracy  of  PENVELOPE  jn  estimating 
the  run  time  for  selected  runs  of  the  NAS  benchmark 
suite  (0.1  increment  binning) 


Figure  5.  The  accuracy  of  Amdahl’s  law  in  estimating 
the  run  time  for  selected  runs  of  the  NAS  benchmark 
suite  (0.1  increment  binning) 
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8.  Conclusions 

i 

i 

i 

PENVELOPE  is  a work  in  progress!  As  such  it 
shows  significant  promise  in  achieving  its  goals. 
However,  at  the  present  time,  some  of  tHose  goals  are 
only  partially  achieved.  It  is  hoped  that  continued  work 
will  rapidly  improve  the  quality  of  the  predictions. 
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